Skip to content

Conversation

@mhamilton723
Copy link
Contributor

@mhamilton723 mhamilton723 commented Sep 18, 2017

  • Replace old readers with new performant Dataframe readers
  • Remove all references to DF.rdd.mapPartitions

@mhamilton723 mhamilton723 force-pushed the streaming branch 3 times, most recently from 6e55de4 to 5a0dfa8 Compare September 18, 2017 20:22
@mhamilton723 mhamilton723 changed the title Refactor Image Reader Refactor MMLSpark for Structured Streaming Sep 19, 2017
@mhamilton723 mhamilton723 force-pushed the streaming branch 6 times, most recently from f4e93cb to 20d048e Compare September 20, 2017 02:44
@mhamilton723 mhamilton723 force-pushed the streaming branch 2 times, most recently from fc00299 to b87a9e6 Compare September 20, 2017 23:52
@@ -1,29 +0,0 @@
// Copyright (C) Microsoft Corporation. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep these in their original places since we'll be moving to Spark Images soon anyway.

.queryName("images")
.start()

Thread.sleep(3000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to check for 6 images being found more directly than wait for 3 seconds? What if it took 4 seconds sometimes or 1 second (in which case you'll be blocking for longer than needed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Copyright (C) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License. See LICENSE in project root for information.

package org.apache.spark.sql.execution.datasources.binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting this in org.apache.spark.image package


inputStream = fs.open(file)
rng.setSeed(filename.hashCode.toLong)
if (inspectZip) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no nested ifs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

class HadoopFileReader(file: PartitionedFile, conf: Configuration, subsample: Double, inspectZip: Boolean)
extends Iterator[BytesWritable] with Closeable {

Logger.getRootLogger.warn("reading " + file.filePath)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take out

filteredPaths.map(_.getPath) ++ filteredDirs.flatMap(p => recursePath(fileSystem, p, pathFilter))
}

def streamUnstructured(ssc: StreamingContext, directory: String): InputDStream[(String, BytesWritable)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider removing support for non-structured stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@drdarshan drdarshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.. please also add a sample notebook since this is a pretty epic change.

@mhamilton723 mhamilton723 force-pushed the streaming branch 2 times, most recently from 234dc65 to 3883acf Compare September 22, 2017 22:20
@mmlspark-bot
Copy link
Contributor

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev1+7.ge8535c7

This is a new version build!

This is a build for github PR #134, changes:


  • Source: mhamilton723/mmlspark,
    streaming at revision e8535c7
    (by Mark Hamilton marhamil@microsoft.com).

  • Build: MMLSpark, 256419
    (built by elbarzil-vm on elbarzil-vm, 2017-09-23 16:08)

  • Info:
    0.8.dev1+7.ge8535c7: mhamilton723/mmlspark/streaming@e8535c7b; MMLSpark#256419

  • Queued by:
    Mark Hamilton for Mark Hamilton

  • Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev1+7.ge8535c7 and --repositories https://mmlspark.azureedge.net/maven.

  • PIP package uploaded.

  • HDInsight: Copy the link to this Script Action to setup this build on a cluster.

  • Documentation uploaded.

@microsoft microsoft deleted a comment from mmlspark-bot Sep 24, 2017
@microsoft microsoft deleted a comment from mmlspark-bot Sep 24, 2017
@microsoft microsoft deleted a comment from mmlspark-bot Sep 24, 2017
@microsoft microsoft deleted a comment from mmlspark-bot Sep 24, 2017
@drdarshan drdarshan removed their assignment Sep 25, 2017
/**
* Thin wrapper class analogous to others in the spark ecosystem
*/
class HadoopFileReader(file: PartitionedFile, conf: Configuration, subsample: Double, inspectZip: Boolean)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def isBinaryFile(df:DataFrame, col: String): Boolean =
df.schema(col).dataType == schema

def recursePath(fileSystem: FileSystem, path: Path, pathFilter:FileStatus => Boolean): Array[Path] ={
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for symlinks

* @param recursive Recursive search flag
* @return DataFrame with a single column of "binaryFiles", see "columnSchema" for details
*/
def read(path: String, recursive: Boolean, spark: SparkSession,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this works with python monkeypatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

case Some(row) =>
val imGenRow = new GenericInternalRow(1)
val genRow = new GenericInternalRow(ImageReader.columnSchema.fields.length)
genRow.update(0, UTF8String.fromString(row.getString(0)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment to direct readers to image schema

* @return returns None if decompression fails
*/
private[spark] def decode(filename: String, bytes: Array[Byte]): Option[Row] = {
def decode(filename: String, bytes: Array[Byte]): Option[Row] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in right namespace and make private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impossible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

filteredPaths.map(_.getPath) ++ filteredDirs.flatMap(p => recursePath(fileSystem, p, pathFilter))
}

def streamUnstructured(ssc: StreamingContext, directory: String): InputDStream[(String, BytesWritable)] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


inputStream = fs.open(file)
rng.setSeed(filename.hashCode.toLong)
if (inspectZip) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.queryName("images")
.start()

Thread.sleep(3000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

false
}
} else {
rng.setSeed(filename.hashCode.toLong)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rng is used internally only, so its not like we are overriding a user supplied rng. I used this method, because distributed, reproducible random splits are a very hard problem. This is because we don't control the list of paths that we load in, spark provides this for us. Also the random split needs to be robust to partitioning strategy, which makes a single RNG impossible, as it would be dependent on the ordering. Here i chose to make the RNG dependent on the filename, which is why it uses the filename as the seed. This will allow for reproducibility, provided the filenames are the same. Also the randomness stays the same because the seeds will all be different provided there are no hash collisions, and in the case of iterating through zip files there random seed is not set every time. I realize now that the above setting of the seeed seems redudant, but harmless so I will remove it and rely on the seed setting in the init

Yes its definitely a hack, but its the least egregious hack i could think of and is fairly performant. I think the real way to do this might be to use the filters provided by the catalyst optimizer, but that involves implementing an entire DSL of filters, and would be more than happy to investigate in a further PR.

@mmlspark-bot
Copy link
Contributor

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev2+3.gccfbee2

This is a new version build!

This is a build for github PR #134, changes:


  • Source: mhamilton723/mmlspark,
    streaming at revision ccfbee2
    (by Mark Hamilton marhamil@microsoft.com).

  • Build: MMLSpark, 265844
    (built by elbarzil-vm on elbarzil-vm, 2017-09-28 18:36)

  • Info:
    0.8.dev2+3.gccfbee2: mhamilton723/mmlspark/streaming@ccfbee25; MMLSpark#265844

  • Queued by:
    Mark Hamilton for Mark Hamilton

  • Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev2+3.gccfbee2 and --repositories https://mmlspark.azureedge.net/maven.

  • PIP package uploaded.

  • HDInsight: Copy the link to this Script Action to setup this build on a cluster.

  • Documentation uploaded.

@mhamilton723 mhamilton723 force-pushed the streaming branch 4 times, most recently from f28f3db to 544b32f Compare September 29, 2017 20:37
@microsoft microsoft deleted a comment from mmlspark-bot Sep 29, 2017
@mmlspark-bot
Copy link
Contributor

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev2+3.g544b32f

This is a new version build!

This is a build for github PR #134, changes:


  • Source: mhamilton723/mmlspark,
    streaming at revision 544b32f
    (by Mark Hamilton marhamil@microsoft.com).

  • Build: MMLSpark, 268665
    (built by elbarzil-vm on elbarzil-vm, 2017-09-29 20:39)

  • Info:
    0.8.dev2+3.g544b32f: mhamilton723/mmlspark/streaming@544b32f8; MMLSpark#268665

  • Queued by:
    Mark Hamilton for Mark Hamilton

  • Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev2+3.g544b32f and --repositories https://mmlspark.azureedge.net/maven.

  • PIP package uploaded.

  • HDInsight: Copy the link to this Script Action to setup this build on a cluster.

  • Documentation uploaded.

drdarshan
drdarshan previously approved these changes Oct 3, 2017
Copy link
Contributor

@drdarshan drdarshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working through all the comments!

@mmlspark-bot
Copy link
Contributor

PASS Pass! — The build has succeeded.

MMLSpark 0.8.dev6+4.g65a3635

This is a build for github PR #134, changes:


  • Source: mhamilton723/mmlspark,
    streaming at revision 65a3635
    (by Mark Hamilton marhamil@microsoft.com).

  • Build: MMLSpark, 273731
    (built by elbarzil-vm on elbarzil-vm, 2017-10-03 19:07)

  • Info:
    0.8.dev6+4.g65a3635: mhamilton723/mmlspark/streaming@65a36356; MMLSpark#273731

  • Queued by:
    Eli Barzilay for Eli Barzilay

  • Maven package uploaded, use --packages com.microsoft.ml.spark:mmlspark_2.11:0.8.dev6+4.g65a3635 and --repositories https://mmlspark.azureedge.net/maven.

  • PIP package uploaded.

  • HDInsight: Copy the link to this Script Action to setup this build on a cluster.

  • Documentation uploaded.

@elibarzilay elibarzilay merged commit 4f1077e into microsoft:master Oct 3, 2017
@mhamilton723 mhamilton723 deleted the streaming branch October 4, 2017 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants